Robust framework for COVID-19 identication from a multicenter dataset of chest CT scans

The main objective of this study is to develop a robust deep learning-based framework to distinguish COVID-19, Community-Acquired Pneumonia (CAP), and Normal cases based on volumetric chest CT scans, which are acquired in different imaging centers using different scanners and technical settings. We demonstrated that while our proposed model is trained on a relatively small dataset acquired from only one imaging center using a specific scanning protocol, it performs well on heterogeneous test sets obtained by multiple scanners using different technical parameters. We also showed that the model can be updated via an unsupervised approach to cope with the data shift between the train and test sets and enhance the robustness of the model upon receiving a new external dataset from a different center. More specifically, we extracted the subset of the test images for which the model generated a confident prediction and used the extracted subset along with the training set to retrain and update the benchmark model (the model trained on the initial train set). Finally, we adopted an ensemble architecture to aggregate the predictions from multiple versions of the model. For initial training and development purposes, an in-house dataset of 171 COVID-19, 60 CAP, and 76 Normal cases was used, which contained volumetric CT scans acquired from one imaging center using a single scanning protocol and standard radiation dose. To evaluate the model, we collected four different test sets retrospectively to investigate the effects of the shifts in the data characteristics on the model’s performance. Among the test cases, there were CT scans with similar characteristics as the train set as well as noisy low-dose and ultra-low-dose CT scans. In addition, some test CT scans were obtained from patients with a history of cardiovascular diseases or surgeries. This dataset is referred to as the “SPGC-COVID” dataset. The entire test dataset used in this study contains 51 COVID-19, 28 CAP, and 51 Normal cases. Experimental results indicate that our proposed framework performs well on all test sets achieving total accuracy of 96.15% (95%CI: [91.25–98.74]), COVID-19 sensitivity of 96.08% (95%CI: [86.54–99.5]), CAP sensitivity of 92.86% (95%CI: [76.50–99.19]), Normal sensitivity of 98.04% (95%CI: [89.55–99.95]) while the confidence intervals are obtained using the significance level of 0.05. The obtained AUC values (One class vs Others) are 0.993 (95%CI: [0.977–1]), 0.989 (95%CI: [0.962–1]), and 0.990 (95%CI: [0.971–1]) for COVID-19, CAP, and Normal classes, respectively. The experimental results also demonstrate the capability of the proposed unsupervised enhancement approach in improving the performance and robustness of the model when being evaluated on varied external test sets.


Introduction
Since the emergence of the novel coronavirus disease  and the consequent global pandemic, healthcare authorities have used different diagnostic technologies to rapidly and accurately detect infected cases. Among such diagnostic technologies, chest Computed Tomography (CT) scans have been widely used, providing informative images of the lung parenchyma. More importantly, CT scans are highly sensitive to the diagnosis of COVID-19 infection, particularly based on its specific abnormality pattern and infection distribution in the lung [1]. To analyze a CT scan, radiologists should review several 2D images (slices), jointly creating a 3D representation of the body. Consequently, the analysis of a CT scan requires a careful review of all slices. Furthermore, the COVID-19 lung imaging manifestations are highly overlapped with those of the Community Acquired Pneumonia (CAP), making the diagnosis even more challenging for radiologists. The aforementioned issues have motivated the development of Artificial Intelligence (AI)-based diagnostic solutions using advancements in Deep Learning (DL) to analyze volumetric CT scans and provide diagnostic labels in a timely fashion [2]. Despite the recent surge of interest and success of DL-based diagnostic solutions, such models commonly fail to achieve acceptable performances when there is heterogeneity in the data characteristics between the train and test sets, which is common when acquiring data from multiple imaging centers [3]. Therefore, the necessity of developing a robust framework is of utmost importance to minimize the effect of the gap between the train and test sets and provide acceptable results on varied external datasets. In the case of CT scans, there are several factors contributing to the characteristics of the images among which, the type of scanners, scanner manufacturers, and scanning protocols have the most influence on the quality and characteristics of the scans [4,5]. Furthermore, the patients' clinical and surgical history can add more complexity and undesired artifacts to the CT scans that might have been blind to the trained model [6].
Capitalizing on the above discussion, this study aims to develop a robust DL-based framework that can be generalized on varied external datasets with high flexibility to update itself upon receiving new external datasets. In this context, on the one hand, the paper introduces an automated two-stage classification framework based on Capsule Networks, which is tailored to robustly classify volumetric chest CT scans into one of the three target classes (COVID-19, CAP, or normal). The proposed Capsule Network-based framework integrates a scalable enhancement approach to boost its performance and robustness in the presence of gaps between the train and test sets regarding types of scanners, imaging protocols, and technical parameters. Furthermore, this paper summarizes the 2021 Signal Processing Grand Challenge (SPGC) on COVID-19 diagnosis (SPGC-COVID challenge), which the authors organized as part of the 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). In particular, an overview of the top six models [7][8][9][10][11][12] developed in the challenge is provided, and their main components are investigated. In addition, the paper introduces a unique test dataset, referred to as the SPGC-COVID dataset, which is available for public access through Figshare [13]. This dataset was used as the test set of the SPGC-CO-VID challenge. SPGC-COVID dataset consists of COVID-19, CAP, and normal cases acquired with various imaging settings from different medical centers. The SPGC-COVID dataset contains four subsets, illustrated in Fig 1, including images with different slice thicknesses, radiation doses, and noise levels. In addition to different technical parameters, the dataset consists of CT scans of patients who have heart diseases or have undergone heart surgery, besides having COVID-19 or CAP infections. It is worth noting that the labels of the SPGC-COVID dataset were not released during the related competition, and participants had only access to the CT images, not the labels. In this study, however, the associated labels are presented along with a comprehensive description of each test set, and a detailed list of technical parameters used to acquire the dataset. In brief, the main novelties of this study are as follows: 1. Development of a Three-way DL-based Diagnostic Framework: Different from most of the recent State-Of-The-Art (SOTA) solutions (such as [14]), where only COVID and non-COVID cases are classified (binary classification), this study aims to distinguish COVID-19, CAP, and normal cases (i.e., a three-way classification is considered). Due to similarities between COVID-19 and CAP cases, the three-way classification is critically challenging and previously developed models cannot be directly applied.

Development of a Clinically Applicable (Generalizable) DL-based Solution:
This study aims to address the problem of generalizability of DL-based models, which has been debated for a long time. Within the context of COVID-19, most of SOTA DL-based solutions are developed based on single-centre datasets. When independent test sets are not used, DL models are more likely to overfit training data distribution or to learn dataset-specific artifacts rather than the true characteristics of the disease. In this study, the aforementioned problem is further investigated to take one step forward toward reaching a clinically applicable AIbased approach.

Development of a Novel Unsupervised Enhancement Mechanism:
To increase performance of DL-based models upon receiving new datasets from different centers and cohorts, a novel and scalable unsupervised enhancement mechanism is proposed.

Introduction of the SPGC-COVID Dataset:
The necessity of avoiding overfitting and having an independent test set with images from a different dataset than the training and validation dataset is not sufficiently emphasized in recent DL-based studies. In particular, Hasan et. al [15] has investigated and validated the aforementioned statement. The introduced SPGC-COVID dataset can address this issue as it contains four different subsets each with specific characteristics in terms of scanning setting, imaging center, and clinical background of the subjects.
The performance of our proposed Capsule Network-based framework is compared with SOTA approaches of the COVID-19 grand challenge. The results demonstrate that our proposed framework outperforms all the submitted models by achieving the overall accuracy of 96.15% (

Materials and methods
In this section, first, the SPGC-COVID dataset is introduced. Then, a brief summary of the 2021 SPGC-COVID challenge is provided, followed by a description of the challenge's best approaches. Finally, the main components of our proposed two-stage Capsule Network-based classification framework are presented, and the introduced unsupervised enhancement approach is described in detail.
This study is conducted following the policy certification number 30013394 of Ethical acceptability for secondary use of medical data approved by Concordia University, Montreal, Canada. Informed consent is obtained from all the patients.

Dataset
In what follows, different datasets used in this study are described individually, followed by supplementary information about the demographic data, imaging protocols, acquisition settings, de-identification, and the labeling process. The utilized dataset consists of a training and a test set, where the training dataset is the COVID-CT-MD [16] we introduced previously and is acquired from one imaging center using similar scanning parameters. The test dataset, the so-called SPGC-COVID, is comprised of four different sets each with specific characteristics to evaluate the robustness and generalizability of the DL model from different aspects.
An overview of different datasets and imaging centers is visualized in Fig 1. Different components of the utilized dataset are as follows: • Train Set: We used our in-house and publicly available dataset [16], referred to as the "COVID-CT-MD", as the training dataset which contains CT scans of COVID-19, CAP, and normal cases acquired by the "SIEMENS, SOMATOM Scope" scanner using the standard radiation dose from Babak Imaging Center, Tehran, Iran. A subset of 55 COVID-19 and 25 CAP cases are analyzed by one radiologist (M.J.R.) to identify slices demonstrating infection. The labeled subset of the data contains 4, 993 slices demonstrating infection and 18, 416 slices without evidence of infection. 30% of the cases in this set are randomly selected as the validation set.
• The SPGC-COVID Test Set: This dataset, which is released through this manuscript, comprises the following four different subsets: • Test Set 1: Low and Ultra-Low dose CT scans of COVID-19 and normal cases acquired from the same imaging center as that of the train set. This dataset is a subset of our inhouse dataset of Low-Dose CT scans [17] and is publicly available.
• Test Set 2: CT scans of COVID-19, CAP, and normal cases acquired in a different imaging center (Tehran Heart Center, Iran) using the "SIEMENS SOMATOM Emotion 16" scanner and different scanning parameters. Some cases in this dataset have an additional history of cardiovascular disease/surgeries with specific CT imaging findings, which are not available in the train set.
• Test Set 3: CT scans of COVID-19, CAP, and normal cases obtained by the same scanner and scanning protocol used to acquire the train set. Cases in this test set are not included in the COVID-CT-MD public dataset.
• Test Set 4: A combination of new CT scans of all three categories (i.e., COVID-19, CAP, Normal) obtained from the same centers as those of Test set 1 and 2, using the same acquisition settings and scanners.
Additional statistical and demographic information about different train and test sets used in this study are provided in Table 1. In Table 1, Center 1 represents the Babak Imaging Center and Center 2 is the Tehran Heart Center. Both imaging centers are located in Tehran, Iran and use the Filtered Back Projection reconstruction method [18] to obtain the CT images. Sample CT slices from the first three test sets are shown in Fig 2. Various scanning protocols and settings have been used to obtain the train and test datasets used in this study. The important parameters that contribute the most to the image quality and characteristics are presented in Table 2.
De-identification. The data used in this study complies with the DICOM supplement 142 (Clinical Trial De-identification Profiles) [19], which ensures that all personal information is removed or obfuscated, including names, UIDs, dates, times, comments, and center-related information. Some demographic and acquisition attributes related to the patients' gender and age, scanner type, and image acquisition settings have been preserved to provide useful information about the dataset.
Labeling process. Diagnosis of the cases scanned in Center 1 is obtained by finding the consensus between three experienced radiologists who have considered the following three main criteria to label the data: (ii) Imaging findings including Ground Glass Opacities (GGOs), consolidations, crazy paving pattern, bilateral and multifocal lung involvement, peripheral distribution, and lower lobe predominance of findings; (iii) Clinical symptoms of the COVID-19 infection, and; (iv) Epidemiology.
For the cases acquired from Center 2, (13/18) COVID-19 cases have positive RT-PCR test results and the remaining cases have been labeled by one experienced radiologist following the same aforementioned criteria. Slice-level labels are provided by one radiologist to identify and label slices with evidence of infection. A subset of 15 random cases has been further reviewed by the two other radiologists to confirm the accuracy of the slice-level labels.

Summary of the 2021 SPGC-COVID challenge
In the first phase of this SPGC-COVID challenge, participants had access to the same train and validation sets as those used in this study to develop and evaluate their models. In the second phase, they have been provided with the first three test sets and had two weeks to submit their final models. Finally, the best-performing models based on the first three test sets have been evaluated on the fourth test set to determine the overall performances. In what follows, the main components of the six best-performing models in the SPGC-COVID challenge [7][8][9][10][11][12] are briefly described. As stated previously, an external model [20] developed based on the same dataset, but not as part of the challenge, is also investigated and used for comparison in this study. This model is also summarized below.
• Ref. [7]: In this model, slice-level predictions are acquired from an EfficientNet-based classifier [21] and a weighted majority voting is proposed to obtain the final patient-level labels.
To train this classifier, the authors first trained two separate binary classifiers to detect slices demonstrating infection from COVID-19 and CAP cases. Then, they fed these models with unlabelled cases to provide the training set for the main classifier. Additionally, they only considered the middle slices (e.g., 80 middle slices) in a volumetric CT scan during the training phase.
• Ref. [8]: This model aggregates the output of six classifiers developed based on the 3D ResNet101 model [22]. One model in this proposed framework is a three-way classifier trained over all of the cases while the other five models are binary classifiers independently trained over COVID-19 and CAP cases using different combinations of train and validation sets.
• Ref. [9]: This model presents a feature extraction-based approach in which a modified pretrained ResNet50 model classifies each slice into the target classes and the penultimate fully connected layer is extracted as the feature map. Next, a max-pooling layer followed by two fully connected layers is used to generate patient-level prediction from slice-level feature maps. The output of this model is then aggregated with two BiLSTM patient-level classifiers, which are fed by the same slice-level feature maps to provide the final patient-level labels.
• Ref. [10]: The pre-trained 3D Resnet50 [23] is the backbone of this model. The authors first doubled the number of slices for each case using a 3D cubic interpolation method. Then, they extracted the lung area using a pixel-based segmentation approach, followed by classical image processing techniques such as pixel filling and border cleaning. Finally, a subset of slices is selected from each volumetric CT scan based on their lung area and an experimentally-set threshold, which are then resized into a (224, 224, 224) data, using a 3D cubic interpolation method, providing the patient-level input for training and evaluation purposes.
• Ref. [11]: This model utilizes a two-stage framework in which the first stage is responsible for performing a multi-task classification to classify 2D slices into one of the target groups and identify the location of the slice in the sequence of CT images at the same time. The model at the first stage uses an ensemble of four popular CNN-based classifiers (i.e., ResneXt50 [24], DenseNet161 [25], Inception-V3 [26], and Wide-Resnet [27]), followed by an aggregation mechanism that divides the whole volumetric CT scan into 20 groups of slices and calculates the percentage of infected slices related to COVID-19 and CAP classes in each group. The values obtained for all groups are then concatenated and fed into an XGboost classifier [28] in the second stage to generate patient-level predictions.
• Ref. [12]: The model proposed in this work initiates with a slice-level EfficientNet-B1 classifier [21] aiming to classify slices and generate feature maps (intermediate layers) to be used in the subsequent sequence classifier. In the sequence classifier, several weak classifiers are trained and the outputs are aggregated using an adaptive weighting mechanism to obtain the final patient-level results. To further enhance the performance of the model and cope with the imbalanced training set, a combination of weak and strong data augmentations are applied to the training cases, forcing the model to produce similar labels for both types of augmented images. Furthermore, to improve the robustness of the model when being tested on varied datasets, a K-Means clustering method (K = 3) [29] is adopted to develop a single classifier for each cluster of the data and aggregate the results via a majority voting approach.
The following provides a brief description of the model, which is not proposed in the SPGC-COVID challenge, but is used for comparison in this study as similar datasets are used for the development and evaluation of the model: • Ref. [20]: This model aims to introduce a robust training algorithm and classification framework, which is capable of being updated upon receiving new datasets to deal with the characteristic shifts in different test sets. First, it adopts a two-stage architecture similar to the COVID-FACT model proposed in reference [14] and trains the benchmark model in a selfsupervised fashion [30], and then the majority voting is adopted to obtain patient-level labels. The backbone model used in this study is DenseNet169 [25] and strict slice preprocessing and sampling methods are applied to the training set. Such methods contain pixelbased approaches with some fixed thresholds used to extract lung areas and select the slices with the most visible lung area. Next, each test set is divided into four quarters, which are then used in an unsupervised updating process, in which quarters are passed to the model sequentially and confident predictions are selected to fine-tune the slice-level classifiers. A slice-level prediction is considered confident in this study if it achieves the probability of at least 0.9 in agreement with the patient-level label.
The key components of our proposed model and those used for comparison are summarized in Table 3. All models resized and normalized the input data to be compatible with the utilized architectures. In addition, most frameworks (except those using a 3D model as their Table 3. Underlying key features of the proposed framework and seven models used for comparison in this study (i.e., top six models developed following the SPGC-COVID challenge, and one model developed outside of the scope of this challenge). backbone) adopted a multi-stage framework transferring information from the slice-level domain to the patient-level one, some of which also utilized an ensemble architecture to aggregate the extracted information.

Proposed Capsule Network-based framework
In this study, we have developed a two-stage framework similar to the model proposed in our previous study [14], referred to as the "COVID-FACT", as our benchmark model to classify volumetric CT scans into three target classes of COVID-19, CAP, and normal. We then use the unlabeled data from the test sets to boost the performance and robustness of the framework on the unseen cases. The pipeline of the proposed framework is shown in Fig 4. It is worth mentioning that although Capsule Network is the building block of models proposed in this study and our previous study [14], these two frameworks target different challenges, and different experiments are performed in the associated studies.
In what follows, different components of the proposed framework are described: • Preprocessing: Raw CT scans, typically, contain uninformative components and unwanted artifacts (e.g., metallic artifacts), which can negatively affect the performance of the DL model. In addition, image sizes may vary and pixel intensities may be in different ranges when the images are acquired by different scanners. As such, we first extracted the lung areas from the CT images to remove the insignificant and distracting components. In this regard, we used a well-trained U-net-based segmentation model [31], which is fine-tuned on COVID-19 cases to specify lung areas in the first step. We then down-sampled all images into the (256 × 256) size to reduce the memory allocation and complexity without significant loss of information. Furthermore, we normalized each 2D image into the [0, 1] interval.
• Stage 1: The first stage performs the infection identification task, which aims to find slices with evidence of infection (caused by CAP or COVID-19) for each patient. The identified slices will then be classified into one of the three target classes in the second stage. The input of Stage 1 is the normalized lung area as a 2D image and the output is the label indicating whether the input image demonstrates infection or not. The classification model used in this stage is based on the Capsule Networks (CapsNets) [32], which have shown a superior discriminative capability compared to their CNN-based counterparts, especially when they are trained over small datasets [33][34][35][36]. Each capsule layer consists of multiple capsules, which are groups of neurons represented by a vector. Capsule Network benefits from an iterative process, known as the "Routing by Agreement", that aims to evaluate the agreement between the capsules in a lower layer on the existence of an object in the higher layer. Using the Routing by Agreement process, the model can recognize the relation between multiple instances in an image. Furthermore, CapsNets have lower time and space complexity compared to conventional CNNs [14]. Such advantages make CapsNet-based models ideal in the case of COVID-19 where small annotated datasets are available and disease manifestations show specific spatial distributions in the lung. The detailed structure of the classification model in the first stage is shown in Fig 3(a). For the first stage, we adopted the same architecture as the model proposed in [14]. More specifically, the model in this stage uses a stack of four convolution layers, one batch normalization layer, and one max pooling layer to generate initial feature maps. Next, the output of the last convolution layer is reshaped to form the first capsule layer, followed by three consecutive capsule layers, as shown in Fig 3(a). The last layer contains two capsules representing the two target classes (i.e., slices with and without the evidence of infection). The length of each capsule represents the probability of the corresponding class being present. Different from COVID-FACT, residual connections are added between the convolution layers to transfer low-level features to the deeper layers. This modification further assists the model in identifying informative features. Additionally, we have added a dropout layer before the capsule layers to overcome the overfitting problems during the training. The labeled subset of the training dataset has been used to train this stage over 100 epochs using the Adam optimizer with the learning rate of 1e − 4. To account for the imbalanced number of slices in each class, we have used a weighted loss function to increase the contribution of the minority group (i.e., slices demonstrating infection) to the final loss value and balance the influence of each class. The balanced loss function used to train Stage 1 is given by where w 1 and w 2 represent the weights corresponding to the loss value calculated for negative and positive samples, respectively. Term loss 1 denotes the loss associated with negative samples, while loss 2 is the loss associated with positive samples. Term N 1 represents the number of negative samples, and N 2 is the number of positive samples.
• Stage 2: The second stage takes the candidate slices from the previous stage and classifies them into one of the COVID-19, CAP, or normal cases. More specifically, we have used the slices demonstrating infection recognized by the first stage for all of the cases in the train set (with or without slice-level labels) to train a three-way classification model. Stage 2 utilizes a CapsNet architecture similar to the one used in the first stage but with smaller dimensions and three capsules in the last layer to represent three target classes. The architecture of stage two is shown in Fig 3(b). Similar to the first stage, we used a weighted loss function to cope with the imbalanced number of samples in some categories. At this stage, the loss weights associated with normal and CAP classes are set to 5 and the weight for the COVID-19 class is set to 1. Note that as the normal cases are extremely rare at this stage, the weights are set differently compared to those calculated by Eq 1, to maintain the stability of the training process, while enforcing the model to pay more attention to the minority classes. We also used the binary cross-entropy loss function, which translates the three-way classification problem at hand into three binary classification tasks. In fact, the loss value is calculated separately for each binary label associated with a target class (i.e., COVID-19, CAP, normal). Finally, a majority voting mechanism is adopted to transfer slice-level predictions into patient-level ones and determine the final label. It is worth noting that an accurate model in the first stage detects only a few candidate slices from normal cases. We can then apply a thresholding mechanism on the output of the first stage to identify those cases with only a few identified infectious slices in the first stage and label them as normal. We have used a threshold of 3% to specify normal cases immediately after the first stage. More specifically, if less than 3% of the slices in a volumetric CT scan are classified as infectious, the corresponding CT scan is classified as a normal case. Based on [37], the minimum lung lesion involvement in patients with COVID-19-related CT findings is 4%. In addition, the minimum percentage of slices demonstrating infection in our training dataset is 7%. In the case that the model in stage 1, misclassifies more than 3% of slices for a normal case, there is still a chance to classify the slices as normal in the second stage.
Unsupervised enhancement. Unseen CT scans acquired by different scanners and scanning protocols contain heterogeneous characteristics leading to lower performance of a pretrained model. To increase the robustness, we take advantage of the extra unlabeled samples that are available via the various test cases and utilize this extra set of CT scans in an unsupervised fashion. In other words, inspired by the ideas from "Active Learning [38][39][40]", where different data samples are extracted to train the model in different stages, and "Semi-Supervised Learning [41,42]", where a label is assigned to unlabelled cases based on a pre-defined metric, we developed an autonomous mechanism to extract and label a part of data in the test sets using a probabilistic selection criteria with reduced complexity. The selected sample and the assigned labels are then used to re-train and boost the initially trained model. More specifically, we selected those test cases for which the model generated the most confident results (i.e., high probability). Similarly, among the selected cases, those with high confidence in slicelevel predictions are used. To define the confident results, the probability of a volumetric CT scan belonging to a specific target class is considered to be equal to the ratio of the slices belonging to that class over the total number of slices (all slices containing the lung lesion), which can be written as follows where X represents the input volumetric CT scan, C represents the number of target classes, and n C i denotes the number of slices belonging to the target class C i . Then, we introduced a confidence threshold value and considered a prediction confident if the probability of the input CT scan belonging to any of the target classes is more than the pre-set threshold. In this study, we have used 80% as the confidence threshold. A similar approach is used to extract confident slices and their corresponding labels. In this case, the probability of a slice belonging to a target class is determined by the output of the CapsNet classifier in Stage 2, which is the length (L 2 Norm) of capsules in the last layer. It is worth mentioning that for those normal cases, which are identified in the first stage using the described thresholding mechanism, we only select the slices which are misclassified as infectious with a high probability (e.g., more than the confidence threshold). Such slices will be labeled as normal in the enhancement phase. Following the aforementioned steps, we can obtain a set of slices and their corresponding labels to augment the training dataset aiming to make the model more aware of the new features available in the unseen datasets and achieving more robust feature maps. Therefore, for each test set, we obtained a set of confident slices and their associated labels which have been added to the train set to re-train the model of the second stage. It is worth noting that the first stage has been kept unchanged in this approach. Finally, after re-training the benchmark model based on the confident slices acquired from each test set, we have obtained several enhanced models (each related to one test set) and averaged the associated patient-level probability scores to achieve the final prediction. This aggregation mechanism depends on the target test set. More specifically, to apply the model on each test set, we take the average of the predictions obtained by the models enhanced over the other test sets. For instance, the model developed for the diagnosis of cases in Test set 1 takes the average of probability scores provided by the models enhanced on Test sets 2 and 3. The main reason for using such an aggregation mechanism is that the enhancement based on a specific test set will further boost the probability scores of confidently predicted slices while having limited influence on other cases in the same set. As such, incorporating the model enhanced on a test set will not bring any further improvement to the evaluation process of the same set. The results presented in Table 4 further support this discussion. It is worth noting that we used the first three test sets to enhance the benchmark model and kept the fourth test set aside for only evaluation purposes. As such, upon receiving new test datasets, we can aggregate the results of the enhanced models on the individual test sets (each representing a specific center or scanning protocol) to provide the classification results for the new cases. The unsupervised model enhancement described above along with the subsequent ensemble averaging make the entire framework a robust automated framework that can be easily improved and updated upon receiving new datasets from different imaging centers.

Results
Our proposed framework adopts a two-stage architecture based on Capsule Networks (Caps-Nets) [32], as shown in Fig 4, which is fed by a volumetric CT scan and provides the probability of the input scan belonging to one of the three target classes. In brief, the first stage identifies CT slices demonstrating infection and passes them to the second stage to be classified as one of the target classes. The output of the first stage is also used to filter normal cases, by applying a 3% threshold on the involvement of the lung parenchyma (i.e., the ratio of the infectious slices in the whole volume). In addition to the proposed framework, four partially enhanced models are developed (based on the four test sets), and the final model aggregates the outputs of the partially enhanced models to provide the final predictions. The proposed enhancement approach extracts confidently predicted images from each test set in an unsupervised fashion, which are then used to update the model's parameters. The final structure of the trained model is selected by minimizing the specified loss function through an optimization process applied on the validation data. In order to investigate the effect of residual and dropout layers on the model's performance, we compared the model's loss and accuracy in the presence and absence of these layers in both Stages 1 and 2. First, residual connection and dropout layers are excluded from Stage 1, which resulted in the model's loss to increase from 0.1770 to 0.1859 while the accuracy is decreased from 93.04% to 92.84%. Since the output of Stage 2 is dependent on the output of Stage 1, we compared the output of Stage 2 once without residual connection and dropout layers and once with the presence of all layers in the whole process. Our results indicate that presence of these layers assist the model to provide a more accurate approximation of the output data as we noticed accuracy improvements from 82.87% to 83.60% while the model's loss decreased from 0.2989 to 0.2849. As a final note, it is worth mentioning that the model's lower accuracy on the validation data, compared to the test data, essentially illustrates the positive impact of the proposed unsupervised enhancement approach. To evaluate the performance of the proposed model and the effectiveness of its unsupervised enhancement approach, we used the first three test sets to enhance the benchmark model and kept the fourth test set aside only for evaluation purposes. Accuracy, sensitivity, and Area Under the Receiver Operating Characteristics (ROC) Curve (AUC) are the performance metrics utilized in this study. Accuracy is calculated as the ratio of correctly classified cases to the total number of cases and demonstrates the overall performance of the model. As we are dealing with a multiclass problem, however, sensitivity (also known as the True Positive Rate (TPR)) is calculated for each class independently and defined as the ratio of correctly classified positive cases (true-positives) among all actual positive cases (true-positives and false-negatives). The last metric is the AUC (micro), which is calculated based on the micro-average of the TPR and False Positive Rate (FPR) values obtained for each class at different thresholds. The FPR is defined as the ratio of wrongly classified positive cases (false-positives) among all actual negative cases (false positives and true-negatives). The reason for using the micro-averaging technique is that we have an imbalanced dataset. This technique aggregates the contribution of the three classes to compute the average of TPR and FPR as follows Micro Average of FPR ¼ where TP, FN, FP, and FN stand for the number of True Positive, False Negative, False Positive, and True Negative cases, respectively. The results obtained by applying the enhanced ensemble model on all test sets are shown in Table 5. In addition, to further validate the obtained results, confidence intervals for the total accuracy and sensitivity are provided using the method introduced in [43].
To elaborate the effect of the proposed unsupervised enhancement approach, we have provided the performance of the benchmark model (i.e., before enhancement) as well as the models enhanced by individual tests sets (i.e., before averaging the outputs) in Table 4. Results shown in Table 4 imply that the probability of the input CT scan belonging to the target class in some misclassified cases has been on the thresholding edge (close to 0.5) and could be corrected after incorporating the models enhanced over other test sets.
In addition to the final patient-level predictions, we have evaluated the performance of the first stage on the validation set in detecting slices demonstrating infection to have a clearer insight into the internal components of the framework. The first stage achieved an accuracy of 93.41%, sensitivity of 91.04%, and specificity of 94.26% in the binary (infectious & non-infectious) classification task. As slice-level labels (i.e., binary labels indicating the existence of infection in a CT slice) are not available for test sets, the result on the validation set is only reported. Moreover, as mentioned earlier, the output of the first stage can be used to identify most normal cases before entering the next stage. We found that nearly all of the normal cases in the four test sets (45/46 cases) have been identified correctly by the thresholding mechanism applied on the output of the first stage, while none of the COVID-19 and CAP cases have been misclassified as normal using this thresholding approach.
In Fig 5, the ROC curves for COVID-19 and CAP cases against other classes (e.g., COVID-19 vs. CAP and Normal) are plotted. The associated AUC values are also provided.

Comparison
We have compared our proposed framework with the top six models [7][8][9][10][11][12] developed in the SPGC-COVID challenge. The Methods section of this paper provides a detailed description of each model, along with an overview of the development and evaluation steps of the challenge. In addition to the models proposed in the challenge, we have further compared our proposed framework with another model, which utilizes the same train and test sets (excluding the 4th test set) to target the same classification task [20]. A brief description of this model is also provided in the Methods section. Experimental results demonstrate that our proposed framework outperforms its counterparts proposed in the SPGC-COVID challenge. Furthermore, it benefits from a scalable enhancement approach that can be integrated into most of the state-of-the-art models to improve their performance when testing on a heterogeneous dataset.
With regard to the development and evaluation process of the SPGC-COVID challenge, it is worth mentioning that the proposed framework and those models whose results are used as a comparison in this study have been developed in an entirely similar fashion. More specifically, the developers had access to the same datasets and labels, even the initial train/validation split of the data was the same. Moreover, no specific restrictions were applied to the development process to prevent privileging a specific type of model. In other words, the proposed framework was not designed to provide a baseline or reference standard for comparison in the challenge. In addition, we would like to highlight that even the proposed benchmark model without the incorporation of the enhancement approach achieved a higher performance than other models, as shown in Table 6. Therefore, given that the same scenario has been in place for all the developers (including us), we believe that the results present a fair and reliable comparison. The performance of the investigated models is presented in Table 6.  Table 6 illustrates the performance of seven automated models developed to tackle the same task as that of this study using the same train and test datasets. We have also compared the overall performance of our proposed framework with the aforementioned models using the statistical McNemar's test [44] with the significance level of 0.05. We tested the hypothesis that the models have the same proportion of errors on the entire test sets. The corresponding p-values are reported in Table 6 and indicate that the hypothesis is rejected for almost all the models except the first one as the corresponding p-value is slightly more than 0.05. In other words, there is a significant difference in the proportion of errors between our proposed framework and six of the aforementioned models while such difference is not significant in the case of the model proposed in Ref. [7].
Based on the key components of the models provided in Table 3 and the results reported in  Tables 6 and 7, we can conclude that using an advanced lung region extraction model such as the U-Net R231 COVIDweb can improve the performance. Moreover, pre-training and data augmentation are used in most of the models, although such techniques were not utilized in our proposed CapsNets-based framework demonstrating the capability of Capsules to be trained using small datasets with limited data augmentation or pre-training. Furthermore, most models have not taken any specific measures to tackle the heterogeneity in the test cases.  [20] used an online unsupervised learning approach to target this issue, it was not adequately trained and designed, in our opinion. This could possibly have led to its low performance. In addition to the components outlined in Table 3, we would like to note that the best-performing model in the challenge (i.e., Ref. [7]) used a customized mechanism to focus on middle slices showing large visible lung areas that could possibly improve its capability to perform well on various cases. However, their approach is static and needs to be adopted depending on slice thicknesses to provide a dynamic slice selection.

Discussion
In this paper, we expanded the fully-automated framework developed in our previous study [14] to tackle the three-way classification task (i.e., identification of COVID-19, CAP, and Normal cases) based on volumetric CT scans acquired from multiple centers using different imaging protocols. We also proposed an unsupervised enhancement approach, which can enable all deep learning-based frameworks to be adapted to the heterogeneity in different test sets. In Table 8, the numbers of slices extracted from each test set to augment the train set are presented. The low number of normal slices demonstrates the high performance of the first stage in identifying slices with and without evidence of infection. As another advantage of the proposed framework, we can mention the capability of the Capsule Network-based model to be trained using a relatively small dataset, which is of utmost importance in the field of Medical Image Processing, in particular the COVID-19 disease, where, typically, small annotated datasets are available. The other noteworthy advantage is that the model does not require any infection annotation, which is a challenging and time-consuming task. The only segmentation used in our study is the lung area segmentation (i.e., extracting the lung parenchyma using a pretrained U-Net model [31]), which is a well-studied task and does not add much complexity to the model. We would like to highlight the effect of the suggested 3% threshold used to identify normal cases based on the outcome of the first stage. As mentioned earlier, 3% is a safe threshold to identify normal cases as it is extremely rare to observe less than 3% involvement of the lung parenchyma in COVID-19 cases. However, it is possible that the number of slices identified as infectious in a normal case exceeds this 3% threshold. This could happen mainly in those CT scans with a large slice-thickness and fewer slices (e.g., less than 100 slices). In such cases, a minor error (a few misclassified slices by the first stage) will mistakenly indicate a large involvement of the lung parenchyma. Such errors can be avoided by increasing the 3% threshold or using an adaptive threshold (e.g., based on the slice-thickness and number of slices) when we are dealing with a fewer number of slices per patient. In this study, only one normal case has been misclassified and increasing the threshold to 6% could remove the error while the other cases were not affected. The promising results and benefits of the first stage in identifying slices demonstrating infection indicate its significant potential to be used in other CT scan-related models to help identify normal cases and concentrate only on a subset of slices rather than the whole volume. Furthermore, we would like to highlight that the results shown in Table 4 demonstrate the incapability of the model enhanced based on a test set to improve the performance of the model on the same set. This is mainly because of the fact that the additional data used to update the benchmark model is constructed by the cases with the highest probability scores (whether correct or not) and incorporating them into the train set will force the model to further increase the corresponding probability scores while does not have much effect on other slices. As such, in the test phase, it is more reasonable to aggregate the outputs obtained by all enhanced models except the one associated with the target test set. It is also worth mentioning that due to the nature of the data (i.e., Medical Images), obtaining a large and diversified dataset from different countries is challenging. However, we will continue to expand the diversity of the dataset to perform more comprehensive investigations on the generalizability of our proposed framework on other test sets as well as determining the maximum level of the shift in image characteristics that can be compensated using our proposed framework.
Finally, it is worth noting that it is possible to design more advanced techniques to select the cases and images from the new test sets using the metrics introduced in the field of Active Learning [38,39] through which the cases which bring more diversity to the training set and the associated feature maps are detected and used for training purposes. In addition to the enhancement techniques in the field of Active Learning, there have been recently several studies on using Generative Adversarial Networks (GANs) to cope with the data and domain shift in medical images [45,46] where the labeled data is not available in the target domain. The main goal in such frameworks is to achieve a domain invariant image representation which can efficiently embed the important features of the image regardless of the imaging modality or imaging technique. Similarly in [47], an auto-encoder and feature augmentation-based approach is proposed to adapt the model with various imaging modalities obtained by different scanners. However, in this study, we are dealing with only one imaging modality (i.e., CT scan) and the level of characteristic shift between the images is lower compared to the images investigated in the aforementioned studies. Moreover, we could achieve high performances using a far less complicated mechanism.
In summary, we have proposed an approach to update the model's parameters by extracting confident predictions from the test sets and utilizing them to re-train the model in order to increase its capability and robustness in the presence of gaps between the imaging protocols and patients' clinical history. We showed that we can train different versions of the model based on different test sets and combine their outputs to generate the final predictions, which are more accurate and robust.
As a final note to our discussion, a technical review on the current clinical significance of chest imaging, in particular CT scans, in COVID-19 diagnosis during the pandemic is provided. Furthermore, we provide an overview of new techniques introduced to minimize the associated cumulative radiation dose imposed on the body to address the concerns raised around the potential risks of CT imaging.

Clinical significance of chest imaging and CT in screening of COVID-19
First of all, we know that the role of chest imaging, especially CT, has evolved during the pandemic following the accumulation of experience and scientific data. The performance of chest imaging has been debated since the early period of the pandemic, where early studies from China showed the superiority of CT over the RT-PCR test [48,49]. This might have been attributed to variabilities in viral load depending on the disease stage and sampling error, as well as the low availability and high demand of the test in the early days of the pandemic [50]. In this regard, some studies have recommended parallel testing using CT and RT-PCT, especially when consecutive tests are required to confirm the infection for treatment planning [51][52][53]. Currently, the RT-PCR test is the most commonly used diagnostic tool for COVID-19 detection, and some scientific societies do not recommend the use of chest CT for COVID-19 screening. On the other hand, chest imaging, specifically CT scanning, has a crucial role in different healthcare environments and clinical scenarios [54]. Although it is not recommended for asymptomatic or mildly symptomatic patients in the absence of accompanying risk factors in an environment, which is abundant in resources, unless they are at risk for disease progression, chest CT is recommended for medical triage of patients with suspected COVID-19 who present with moderate to severe symptoms and a high pretest probability of the disease regardless of the RT-PCR test, or in resource-constrained environments where RT-PCR tests may not be readily available or test results might be delayed. Chest CT imaging is also recommended for patients with worsening respiratory status. As such, it is clear that despite the widespread use of the RT-PCR test for screening of COVID-19 infection, the role of chest CT is irreplaceable in providing a baseline for future comparison, revealing an alternative diagnosis, establishing manifestations of important comorbidities in patients with risk factors for disease progression, and influencing treatment strategy and the intensity of monitoring for clinical deterioration [48].
In addition, it should be noted that the RT-PCR test is accompanied by a high false negative rate. Based on the latest guidance provided by the World Health Organization (WHO) for critical preparedness, readiness, and response actions for COVID-19 [55], such errors most likely occur due to technical reasons such as new virus mutations, the inhibition of the PCR reaction, sampling or storage errors, and timing of the test (i.e., too early or too late examinations). It is also worth noting that although a consecutive RT-PCR test could reduce the error probability, several studies [56,57] have shown that such false negative results are not limited to the first test, and multiple negative PCR results have been reported for many cases while the significant progression of the disease has been confirmed through their CT scans.
In summary, the rt-PCR test is currently considered as the most reliable tool for screening for COVID-19. As stated above, there are certain clinical scenarios, which can include a significant number of cases, where the role of CT in diagnosing COVID-19 infection is irreplaceable. These scenarios include: (i) Suspected false negative PCR test, i.e., moderately or severely symptomatic patients with a high pretest probability of the disease regardless of the rt-PCR test; (ii) Unavailability of PCR test, i.e., resource-constrained environments, where the rt-PCR test may not be available, and; (iii) Latency, i.e., the rapidity with which the information is provided with chest CT compared to an rt-PCR test makes it the preferred method of diagnosis in certain environments. In conclusion, although we believe that there is a critical role for chest CT in correctly identifying COVID-19 infection in addition to being a critical approach for the detection of complications and prognosis, there are concerns about the limited clinical significance of chest CT as a tool for diagnosis of COVID pneumonia.

Radiation dose reduction in CT imaging
As the next part of our discussion, we would like to address the concerns raised around the potential risks of radiation exposure caused by CT imaging and its side-effects on the patients' bodies. In particular, we would like to highlight the recent recommendations in the utilization of low-dose and ultra-low-dose CT imaging protocols, which ensure that minimized radiation is imposed on the body while the images are still of high quality and could reveal specific radiologic findings. Several studies have introduced specific technical settings that result in a significant reduction in the associated CT dose index (CTDIvol). As an example of such studies, Reference [58] reported adequate assessment of pulmonary opacities related to COVID-19 pneumonia at 100 kV with tin filter and iterative reconstruction technique with a CTDIvol of 0.4 mGy versus the standard-dose protocol with the CTDIvol of 3.4 mGy. Another study [58] applied 100 kV with tin filter and 0.6 second exposure time using a high pitch and fast gantry rotation time to acquire chest CT images at 0.6 mGy CTDIvol, which were comparable to standard-dose chest CT at 6.4 mGy. Furthermore, we would like to highlight that the low-dose scanning protocol used to acquire the CT scans in test sets 1 and 4 of this study has resulted in a significant reduction in the average radiation dose by reducing the reference mAs value from 50 to 15 − 20. More specifically, the radiation dose was reduced from the estimated value of 7mSv in standard-dose scans to 1 − 1.5mSv in LDCT scans, and 0.3mSv in the ULDCT ones, which is as low as that of a single chest radiograph. The results obtained for the aforementioned test sets demonstrate the effectiveness of such scans in providing specific CT findings that are efficiently captured by the proposed framework in an automated fashion. In conclusion, the concerns around cumulative radiation exposure on patients' bodies could be mitigated to some extent, especially during a pandemic when a larger population is in need of being scanned.